Topic Modeling for Native Language Identification
نویسندگان
چکیده
Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent factors, and investigates their effect in a classification task. We find that although (not unexpectedly) classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
منابع مشابه
Feature Extraction for Native Language Identification Using Language Modeling
This paper reports on the task of Native Language Identification (NLI). We developed a machine learning system to identify the native language of authors of English texts written by non-native English speakers. Our system is based on the language modeling approach and employs crossentropy scores as features for supervised learning, which leads to a significantly reduced feature space. Our metho...
متن کاملA Comparative Study of Lexical Bundles in Soft Science Articles Written by Native and Iranian Authors
Writing academic texts by novice researchers requires a framework and support by learning how to cite the works of others. However, compared to the studies on other academic writings, studying citations by considering certainty markers has received little attention. The main purpose of this study was to investigate the shifts of certainty markers (hedges and boosters) in pre- and post-citation ...
متن کاملMeasuring Interlanguage: Native Language Identification with L1-influence Metrics
The task of native language (L1) identification suffers from a relative paucity of useful training corpora, and standard within-corpus evaluation is often problematic due to topic bias. In this paper, we introduce a method for L1 identification in second language (L2) texts that relies only on much more plentiful L1 data, rather than the L2 texts that are traditionally used for training. In par...
متن کاملRobust, Lexicalized Native Language Identification
Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of ...
متن کاملCan characters reveal your native language? A language-independent approach to native language identification
A common approach in text mining tasks such as text categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. In this work, an approach that uses character n-grams as features is proposed for the task of native language identification. Instead of doing standard feature selection,...
متن کامل